Marginalizing Corrupted Features

نویسندگان

  • Laurens van der Maaten
  • Minmin Chen
  • Stephen Tyree
  • Kilian Q. Weinberger
چکیده

The goal of machine learning is to develop predictors that generalize well to test data. Ideally, this is achieved by training on an almost infinitely large training data set that captures all variations in the data distribution. In practical learning settings, however, we do not have infinite data and our predictors may overfit. Overfitting may be combatted, for example, by adding a regularizer to the training objective or by defining a prior over the model parameters and performing Bayesian inference. In this paper, we propose a third, alternative approach to combat overfitting: we extend the training set with infinitely many artificial training examples that are obtained by corrupting the original training data. We show that this approach is practical and efficient for a range of predictors and corruption models. Our approach, called marginalized corrupted features (MCF), trains robust predictors by minimizing the expected value of the loss function under the corruption model. We show empirically on a variety of data sets that MCF classifiers can be trained efficiently, may generalize substantially better to test data, and are also more robust to feature deletion at test time.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Marginalizing stacked linear denoising autoencoders

Stacked denoising autoencoders (SDAs) have been successfully used to learn new representations for domain adaptation. They have attained record accuracy on standard benchmark tasks of sentiment analysis across different text domains. SDAs learn robust data representations by reconstruction, recovering original features from data that are artificially corrupted with noise. In this paper, we prop...

متن کامل

Denoising Criterion for Variational Auto-Encoding Framework

Denoising autoencoders (DAE) are trained to reconstruct their clean inputs with noise injected at the input level, while variational autoencoders (VAE) are trained with noise injected in their stochastic hidden layer, with a regularizer that encourages this noise injection. In this paper, we show that injecting noise both in input and in the stochastic hidden layer can be advantageous and we pr...

متن کامل

Learning Bilingual Word Representations by Marginalizing Alignments

We present a probabilistic model that simultaneously learns alignments and distributed representations for bilingual data. By marginalizing over word alignments the model captures a larger semantic context than prior work relying on hard alignments. The advantage of this approach is demonstrated in a cross-lingual classification task, where we outperform the prior published state of the art.

متن کامل

Feature Analysis of Chromatic or Achromatic Components based on Tensor Voting and Text Segmentation using Separated Clustering Algorithm

This paper presents a new technique for segmenting corrupted text images on the basis of color feature analysis by second order tensors. It is show how feature analysis can benefit from analyzing features using second order tensor with chromatic and achromatic components. Proposed technique is applied to text images corrupted by manifold types of various noises. Firstly, we decompose an image i...

متن کامل

Combining a Gaussian Mixture Model Fro

Fitting a Gaussian mixture model (GMM) to the smoothed speech spectrum allows an alternative set of features to be extracted from the speech signal. These features have been shown to possess information complementary to the standard MFCC parameterisation. This paper further investigates the use of these GMM features in combination with MFCCs. The extraction and use of a confidence metric to com...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1402.7001  شماره 

صفحات  -

تاریخ انتشار 2014